Goto

Collaborating Authors

 non-smooth function


fa2246fa0fdf0d3e270c86767b77ba1b-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for a careful reading and the feedback of our submission. We did not pursue theoretical results for PIA because of its lackluster empirical performance. In Line 99, we will change the gradient to subgradient. The definitions of interpolation we use are in [3]. We cap the iterations in the simulations at 1000; we will note this in the final version of the paper.



fa2246fa0fdf0d3e270c86767b77ba1b-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for a careful reading and the feedback of our submission. We did not pursue theoretical results for PIA because of its lackluster empirical performance. In Line 99, we will change the gradient to subgradient. The definitions of interpolation we use are in [3]. We cap the iterations in the simulations at 1000; we will note this in the final version of the paper.


Reviews: A theory on the absence of spurious solutions for nonconvex and nonsmooth optimization

Neural Information Processing Systems

This paper studies the condition for absence of spurious optimality. In particular, the authors introduce'global functions' to define the set of continuous functions that admit no spurious local optima (in the sense of sets), and develop some corresponding definitions and propositions for an extending characterization of continuous functions that admit no spurious strict local optima. The authors also apply their theory to l1-norm minimization in tensor decomposition. Pros: In my opinion, the main contribution of this paper is to establish a general math result and apply it to study the absence of spurious optimality for a specific problem. I also find some mathematical discoveries on global functions interesting, which include: -- In section 2, the paper provides two examples to show that: (i).


Reviews: Provably Correct Automatic Sub-Differentiation for Qualified Programs

Neural Information Processing Systems

In this submission, the authors consider the problem of computing sub-differentiation for a class of non-smooth functions automatically and correctly. They give a very nice example that illustrates problems with current automated differentiation frameworks, such as tensorflow and pytorch. Then, the authors prove a chain rule for the one-sided directional derivative of a composite non-smooth function satisfying certain assumptions. Based on this rule, the authors derive a (randomized) algorithm for computing such a derivative for a particular kind of programs only with constant overhead. The algorithm is very similar to the one for back-ward automatic differentiation except that its forward computation is based on the newly-proved chain rule in the submission, rather than the standard chain rule for differentiation.


Communication Complexity of Distributed Convex Learning and Optimization

Neural Information Processing Systems

We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other things, our results indicate that without similarity between the local objective functions (due to statistical data similarity or otherwise) many communication rounds may be required, even if the machines have unbounded computational power.


Pointwise convergence theorem of gradient descent in sparse deep neural network

Yoneda, Tsuyoshi

arXiv.org Artificial Intelligence

The theoretical structure of deep neural network (DNN) has been clarified gradually. Imaizumi-Fukumizu (2019) and Suzuki (2019) clarified that the learning ability of DNN is superior to the previous theories when the target function is non-smooth functions. However, as far as the author is aware, none of the numerous works to date attempted to mathematically investigate what kind of DNN architectures really induce pointwise convergence of gradient descent (without any statistical argument), and this attempt seems to be closer to the practical DNNs. In this paper we restrict target functions to non-smooth indicator functions, and construct a deep neural network inducing pointwise convergence provided by gradient descent process in ReLU-DNN. The DNN has a sparse and a special shape, with certain variable transformations.


Adaptive Gradient Methods for Constrained Convex Optimization

Ene, Alina, Nguyen, Huy L., Vladu, Adrian

arXiv.org Machine Learning

Gradient methods are a fundamental building block of modern machine learning. Their scalability and small memory footprint makes them exceptionally well suite d to the massive volumes of data used for present-day learning tasks. While such optimization methods perform very well in practi ce, one of their major limitations consists of their inability to converge faster by taking advantage of specific features of the input data. For example, the training data used for classification tasks may exhibit a few very informative features, while all the others have only marginal relevance. Having access t o this information a priori would enable practitioners to appropriately tune first-order optimizat ion methods, thus allowing them to train much faster. Lacking this knowledge, one may attempt to reach a si milar performance by very carefully tuning hyper-parameters, which are all specific to the learning mod el and input data. This limitation has motivated the development of adaptive m ethods, which in absence of prior knowledge concerning the importance of various features in the da ta, adapt their learning rates based on the information they acquired in previous iterations. The most notable example is AdaGrad [ 13 ], which adaptively modifies the learning rate corresponding to each coordinate in the vector of weights. Following its success, a host of new adaptive methods appeared, inc luding Adam [ 17 ], AmsGrad [ 27 ], and Shampoo [ 14 ], which attained optimal rates for generic online learning tasks.


Deep Neural Networks Learn Non-Smooth Functions Effectively

Imaizumi, Masaaki, Fukumizu, Kenji

arXiv.org Machine Learning

We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods attain optimal convergence rates, and thus it has been difficult to find theoretical advantages of DNNs. This paper fills this gap by considering learning of a certain class of non-smooth functions, which was not covered by the previous theory. We derive convergence rates of estimators by DNNs with a ReLU activation, and show that the estimators by DNNs are almost optimal to estimate the non-smooth functions, while some of the popular models do not attain the optimal rate. In addition, our theoretical result provides guidelines for selecting an appropriate number of layers and edges of DNNs. We provide numerical experiments to support the theoretical results.


Communication Complexity of Distributed Convex Learning and Optimization

Arjevani, Yossi, Shamir, Ohad

Neural Information Processing Systems

We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other things, our results indicate that without similarity between the local objective functions (due to statistical data similarity or otherwise) many communication rounds may be required, even if the machines have unbounded computational power.